





ISSN 0840-8440 



PROCEEDINGS 



TECHNOLOGY TRANSFER CONFERENCE 1988 
November 28 and 29, 1988 
Royal York Hotel 
Toronto, Ontario 



SESSION D 
ANALYTICAL METHODS 



Sponsored by 
Research and Technology Branch 
Environment Ontario 
Ontario, Canada 



f\i\S(^ 



Copyright Provisions and Restrictions on Copying: 

This Ontario Ministry of the Environment work is protected by Crown copyright 
(unless otherwise indicated), which is held by the Queen's Printer for Ontario. It 
may be reproduced for non-commercial purposes if credit is given and Crown 
copyright is acknowledged. 

It may not be reproduced, in all or in part, for any commercial purpose except 
under a licence from the Queen's Printer for Ontario. 

For information on reproducing Government of Ontario works, please contact 
ServiceOntario Publications at copvright(g)ontario.ca 



J 



DP11 

ROBUSTNESS OF SIMPLE HYPOTHESIS TESTING METHODS WITH CENSORED 
EN VKONMENTAL QUALITY DATA 

E.ECreeie 

Creese Environmemal Consulting, 

P.O. Box 91, Waterloo, Ontario 

N2J3Z6 

INTRODUCTION 

Frequently when a chemical parameters of environmental concern (e.g. chlorinated 
organics or heavy metals) is measured, a sample will contain some observations below the 
analytical detection limit. Such data is termed a censored data set 

Previous studies (Gilliom & Helsel, 1986; Helsel & Gilliom, 1986) have shown that, of 
the methods that they investigated, log-probability regression to the curve of a normal 
distribution was the most robust for estimating population parameters such as the mean and 
standard deviation of parent populations of environmental data. This work takes as a 
starting point the assumption that the lognormal distribution is the most likely parent 
distribution of chemical water quality data. It endeavors to assess the reliability of simple 
hypothesis testing, such as the z test or the t test, when population or sample parameters are 
estimated by log-probability regression. 

Tlie basic procedure is to generate a large number of simulated samples, all with a specified 
number of observations. Then censoring is performed at a certain defmed cut-off point. 
The mean and variance of the population (or sample, depending on the method) are 
estimated by log-probability regression. Then, a hypothesis test is performed using the 
estimated mean and variance to determine if the sample inean can be inferred to be equal to 
the population mean. Since the population mean is known, which, of course, is the whole 
point of using simulated data, one can say whether or not the statistical test failed. Such a 
failure, that of rejecting a true null hypothesis, is termed a type I error. This work involves 
the use of Monte Carlo methods to compare to compare nominal type I error rates to actual 
type I error rates. At present, several regression methods are being compared, but only one 
is reported on here. The work is still in progress. For this reason, methods are the main 
topic of discussion in this paper. Only preliminary results are available. 

METHODS 

The work is being carried out on a Macintosh Plus microcomputer with a 20 megabyte hard 
drive. Random number generation and statistical routines arc performed with the Systat 
application. All the Systat calculations are done in double precision, an automatic feature of 
most Systat routines. Iterative Systat programs are assembled by a HyperCard program and 
submitted for batch execution to Systat using the Macromaker utility. Macromaker returns 
control to HyperCard, which then composes the next set of Systat programs, and so on. 

Generation nf random numbers 

The algorithm used by Systat for the generation of random numbers is given in Wichrnan & 
Hill (1982). It has a cycle length of 2.78 X lO'^. On opening the Systat DATA module, 
the random number generator always starts at the same point in the cycle. It is possible to 
provide a seed number between 1 and 30,000 that can be entered to make the generator start 
at a different point. This, in effect, reduces the cycle to 30,000, supposing, of course, that 
the choice of the seed is random. For this reason, a batch of 10,000 random numbers with 
the required probability distribution was produced initially in a single session in the DATA 



299 



module. When, say, another 10,000 are required, the Systat DATA module will be 
opened, 20,000 random numbers will be produced, discarding the first 10,000. 

Generarion of random samples 

As mentioned above, a set of 10,000 lognormally distributed random numbers, xj, was 
produced. The population median was set to v, = 1 and the coefficient of variation to cv, = 
1. The first step in generating the numbers was to generate the corresponding normally 
distributed numbers, yj, where Xj = exp[yi]. Normal random samples were simulated 
using the equation: 

yi = Hy + Oy e; 

In this equation, t is the error letro, which is a normally distributed random variatc with a 
mean of and a variance of 1 . It was provided by the Systat random number generator. It 
is apparani that: 

tiy = Vy = !n[vi] = 0. 
The relationship between Oy and cvj is given by Aitchison & Brown (1957); 

cvj^ = exp[ay'l - 1 . 

Samples of sample size n = 10 were abstracted from the set thus generated, the fint 10 
numbers becoming the first sample, the next 10 becoming the second sample, etc, 

Censnrint 

Four censoring points were chosen to correspond to the 20ih, 40th, 60th and 80ih 
percentiles of the parent population. These correspond to four hypothetical detection limits, 
XD. of chemical analysis. They are calculated irom the normal t^suibution as follows: 

XD " expbiy + z[p] Oy] I p E (.2,.4,.6,.8) 

Table 1. Detection limits, xp, corresponding to the 20th, 40th, 60lh and 80th 
percentiles of a lognormally distributed quantity with median value and coefficeni of 
variation both equal to 1 . 



Percentile 


"D 


20 
40 
60 
80 


0.496 
0.810 
1.235 
2.015 



Biiseline Oieclt 

The set of 10,000 random numbers generated were tested for compliance with the original 
population parameters that defined it. This was done to ensure that the computer programs 
were indeed written and executed according to plan, in other words, to make sure that 
everything was done right. To accomplish this, l,OOlo samples of n = 10 were tested in 
turn by a Student's t test in Systat module, STATS, against the true hypothesis that Uy = 0. 
Unlike SAS, or SPSS, which determine the probabilies corresponding to t by interpolating 
from a table, Systat calculates the probability, resulting in greater accuracy. The algorithm 
for this calculation is decribed in Lund & Lund (1983). 

The results of this check are shown in Table 2, At a nominal type I error rate of o = 5%, it 
is to be expected that the t lest will fail 5% of the time. Table 2 shows that the actual type I 
error was a = 4.9%. which is well within the 99.9% confidence limits of the nominal type I 



1 



error rate. The 99.9% confidence interval of a was determined from the following 

expression: 

a±2[.9995](a(l-o)/m)<'5 
where m is the number of tests, in this case, 1 ,000. 

Table 2. Results of l.fKX) t tests performed on simulated environmental quality data, a 
is the nominal type I error rate and a is the actual type I error rate. 



a 


a 


99.9% confidence interval of a 


n 


4.9* 


4.8% - 5.H 



Simulation Runs 

In these runs, simulated samples of n = 10 were ordered by rank. Then they were 
censored at one of the four points listed above in Table I . Normal scores, z, were then 
computed for each of the remaining sample observations by: 

Zi = z[Ri/(n+l)], 

where Ri is the rank of the observation. Regression of ln[x) on z was then done according 
to the inodel, 

ln(x] = my + Sy z. 

to obtain estimates of the population mean, ny, and standard deviation, oy. At higher levels 
of censoring, some samples had to be discartkd, since regression could not be performed if 
less than two sample observations remained above xp. After regression, hypothesis testing 
was done, choosing as the null hypothesis that the sample mean is equal to the population 
mean. Type I errors were counted, and at the end of each run, a, the actual type I error rate 
was calculated. 

RESULTS 

To date, very few data generating runs have been completed. So far hypothesis testing has 
been limited to a z test. A confidence interval for the appropriate probability, p, is 
constructed about the estimate of population mean, thus: 

my±z[pJSy/Vnlp = 1 -a/ 2 

A type I error is indicated if the true population mean, ny, falls outside the confidence 
interval. 

Table 3. Results of z tests on 500 samples of sample number n = 10, performed on 
simulated environmental quality data. The nominal type I error rate is a = 5%. a is the 
actual type I error rate, m is the number of z tests actually perfonned. 



censoring level 


a 


m 


95% confidence interval of a 


6% 


4.H 


iOa 


4.0* - 6Mo 


20% 


6.2% 


500 


4.0% - 6.0% 


40% 


12.8% 


499 


4.0% - 6.0% 


60% 


26.7% 


475 


4.0% - 6.0% 


80% 


50.5% 


325 


3.8% - 6.2% 



The trend seen in Table 3 is of course what would be expected. The power of the z test to 
distinguish the correct hypothesis decreases with the level of censoring. It can be seen that 
at 80% censoring, the z test has lost all discriminating power. It is basically a 50:50 chance 
whether or not it picks the correct hypothesis. 



300 



301 



1 



DISCUSSION 

More preliminary experimeniation is required before production of tables can begin. It will 
be necessary to know, for example, whether the absolute number as well as the proportion 
of observations above detection influences the actual type I enor rate. 
The ultimate goal of this work is to generate tables so that one could perform a standard 
hypothesis test at a given nominal type I error rate and then look up what the actual type I 
error is Alternatively, one could decide, given the degree of censoring, what nomm^ type 
I enor to test a hypothesis at in order to obtain the desired actual type I error probabihty. 
Such tables would be of great use to pracusing environmental scientists. 

REFERENCES 

Aitchison, J. & J. A.C. Brown. 1957. The Lognomal Distribution with special r^erence 
to its uses in economics. Cambridge University Press. 

Gilliom RI &DR Helsel. 1986. Estimation of distributional parameters for censored 
trace 'level water quality data. I. Estimation techniques. Water Resources Research. 
22(2): 135-146. 

Helsel D R & R J Gilliom. 1986. Estimation of distributional parameters for censored 
trace level water quality data. 2. Verification and applications. Water Resources 
Research, 22(2): 147-155. 

Lund, R.E. & J.R. Lund. 1983. Probabilities and upper quantiles for the studeniized 
range. Algorithm AS 190. App/iedStariiiicj, 3i2.- 204-210. 

Wichman, B.A. & l.D. Hill. 1982. An efficient and portable pseudo-random number 
generator. Algorithm AS 183. AppliedStatistics,31I: 188-190. 



302 






(8177) 
TD/5/T43 



